Project description

This project aims to present a different approach to data visualization. It is based on data drom the Palmer Archipelago (Antarctica) and includes information on penguin size, gender, and the islands they live on. The project will contain graphs that show a lot of information and ones that tell almost no story. A brief interpretation will also be added to the charts.


Dataset & libraries loading with quick data cleaning

Variables: species, island and sex was changed to factor type and only complete cases was filtered.

# Setting wd and loading data
setwd("C:/Users/Kasper/OneDrive/Pulpit/Applied Statistics/R Learning/Penguin explorig")
dane <- read.csv("penguins_size.csv")

# Essential libraries
library(ggplot2)
library(tidyverse)
library(dplyr)
library(gridExtra)
library(extrafont)

# Quick data cleaning
dane <- dane %>%
  mutate(species = factor(species), island = factor(island), sex = factor(sex)) %>%
  filter(complete.cases(dane) & dane$sex != ".")

Bar plots examples

Example of bad visualization

In the graph below it is only possible to see the number of penguins by species. In addition, it would be worth expanding this chart.

ggplot(dane, aes(x=species)) + geom_bar()


Examples of better aproach

The next two charts allow you to see new regularities in the dataset. For example penguins from species Adelie live on every island, Gentoo live only on island Biscoe and Chinstrap live only on Dream island. It can also be seen that in each species the number of males and females is almost equal.

dane %>%
  drop_na() %>%
  count(island, species) %>%
  ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
  geom_label(aes(x = species, y = n, label = n)) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  facet_wrap(~island) +
  theme_minimal() +
  ggtitle('Penguins species number by islands') + 
  theme(axis.title.y = element_text(color="Grey23", size=16),
        axis.text.y = element_text(size=10),
        axis.title.x=element_blank(),
        legend.title = element_text(size=12),
        legend.text = element_text(size=8),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=22, family="serif"))


dane %>%
  drop_na() %>%
  count(sex, species) %>%
  ggplot() + geom_col(aes(x = sex, y = n, fill = sex)) +
  geom_label(aes(x = sex, y = n, label = n)) +
  scale_fill_manual(values = c("dodgerblue2","coral2")) +
  facet_wrap(~species) +
  theme_minimal() +
  ggtitle('Penguins female & male number by species') + 
  theme(axis.title.y = element_text(color="Grey23", size=16),
        axis.text.y = element_text(size=10),
        axis.title.x=element_blank(),
        legend.title = element_text(size=12),
        legend.text = element_text(size=8),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=22, family="serif"))


Histograms examples

Example of bad visualization

The charts below show only the distribution of the variables: flipper_lenth_mm, culmen_depth_mm and flipper_length_mm without the division into qualitative variables.

bad1 <- ggplot(dane, aes(x = culmen_length_mm)) + geom_histogram(binwidth = 2)
bad2 <- ggplot(dane, aes(x = culmen_depth_mm)) + geom_histogram(binwidth = 1)
bad3 <- ggplot(dane, aes(x = flipper_length_mm)) + geom_histogram(binwidth = 5)

grid.arrange(bad1,bad2,bad3)


Examples of better approach

On the other hand, in the histograms below, the distribution of variables has been presented in a more elegant way and the difference between species broken down into islands in which they live has been shown. Also, we can see distribution of flipper_length_mm broken down into sex and species.

a <- ggplot(dane, aes(x = culmen_length_mm, fill=species)) + geom_histogram(binwidth = 2, alpha = 0.7) + facet_grid(~island) + theme_minimal() +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) + 
  ggtitle('Distribution of variables') + 
  theme(axis.title.y = element_text(color="Grey23", size=11),
        axis.text.y = element_text(size=9),
        axis.title.x = element_text(color="Grey23", size=11),
        axis.text.x = element_text(size=9),
        legend.title = element_text(size=10.5),
        legend.text = element_text(size=8.5),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=22, family="serif"))

b <- ggplot(dane, aes(x = culmen_depth_mm, fill=species)) + geom_histogram(binwidth = 1, alpha = 0.7) + facet_grid(~island) + theme_minimal() +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) + 
  theme(axis.title.y = element_text(color="Grey23", size=11),
        axis.text.y = element_text(size=9),
        axis.title.x = element_text(color="Grey23", size=11),
        axis.text.x = element_text(size=9),
        legend.title = element_text(size=10.5),
        legend.text = element_text(size=8.5),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"))

c <- ggplot(dane, aes(x = flipper_length_mm, fill=species)) + geom_histogram(binwidth = 5, alpha = 0.7) + facet_grid(~island) + theme_minimal() +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) + 
  theme(axis.title.y = element_text(color="Grey23", size=11),
        axis.text.y = element_text(size=9),
        axis.title.x = element_text(color="Grey23", size=11),
        axis.text.x = element_text(size=9),
        legend.title = element_text(size=10.5),
        legend.text = element_text(size=8.5),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"))
grid.arrange(a,b,c)


ggplot(dane, aes(x = flipper_length_mm, fill=sex)) + geom_histogram(binwidth = 5, alpha = 0.7) + facet_grid(~species, scales="free") + theme_minimal() +
  scale_fill_manual(values = c("dodgerblue2","coral2")) +
  ggtitle('Distribution of flipper length by sex and species') + 
  theme(axis.title.y = element_text(color="Grey23", size=16),
        axis.text.y = element_text(size=10),
        axis.title.x = element_text(color="Grey23", size=16),
        axis.text.x = element_text(size=10),
        legend.title = element_text(size=12),
        legend.text = element_text(size=8),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=22, family="serif"))


Boxplots examples

Example of bad visualization

This box plot provides some information but could be better represented graphically

ggplot(dane, aes(x = species, y = body_mass_g)) + geom_boxplot()


Examples of better aproach

On the other hand, the box plots below provide more information as it is possible to see the distribution of variables broken down into characteristics such as species and gender.

ggplot(dane, aes(x = species, y = body_mass_g/1000, fill=species)) + geom_jitter() + geom_boxplot(size=1.2, alpha=0.5) + facet_grid(~sex) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) + theme_bw() + ylab("Body weight in kilograms") + 
  ggtitle("Penguins bodyweight by species and sex") +
  theme(axis.title.y = element_text(color="Grey23", size=16),
        axis.text.y = element_text(size=10),
        axis.title.x=element_blank(),
        strip.background = element_rect(fill='lightskyblue1'),
        strip.text = element_text(family="serif", face="bold"),
        legend.title = element_text(size=12),
        legend.text = element_text(size=8),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=22, family="serif"))


ggplot(dane, aes(x = sex, y = culmen_length_mm, fill=sex)) + geom_jitter() + geom_boxplot(size=1.2, alpha=0.5) + facet_grid(~species) +
  scale_fill_manual(values = c("dodgerblue2","coral2")) + theme_bw() + ylab("Culmen length in mm") + 
  ggtitle("Culmen length by species and sex") +
  theme(axis.title.y = element_text(color="Grey23", size=16),
        axis.text.y = element_text(size=10),
        axis.title.x=element_blank(),
        strip.background = element_rect(fill='lightskyblue1'),
        strip.text = element_text(family="serif", face="bold"),
        legend.title = element_text(size=12),
        legend.text = element_text(size=8),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=22, family="serif"))


Scatterplot examples

Example of bad visualization

In this scatterplot you only see the black points scattered around. Some clusters can be seen, but to better visualize them, the graph should be supplemented with some qualitative features

ggplot(dane, aes(x = culmen_length_mm, y = culmen_depth_mm)) + geom_point()


Examples of better aproach

After distinguishing the qualitative features, it is clearly visible in the charts below some relationships between the quantitative variables and the qualitative features

ggplot(dane, aes(x = culmen_length_mm, y = culmen_depth_mm, color=species)) + geom_point(aes(shape=species, size=sex), alpha = 0.8) +
  scale_shape_manual(values=c(18, 16, 17)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  ggtitle("Relationship between culmen length and depth") +
  theme(axis.title.y = element_text(color="Grey23", size=15),
        axis.text.y = element_text(size=12),
        axis.title.x = element_text(color="Grey23", size=15),
        axis.text.x = element_text(size=12),
        legend.title = element_text(size=12),
        legend.text = element_text(size=10),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=25, family="serif"))


ggplot(dane, aes(x = flipper_length_mm, y = body_mass_g/1000, shape=species, color=species)) + geom_point() + geom_smooth(method=lm) +
  theme_minimal() + facet_grid(~sex) +
  ggtitle("Relationship between body weight and flipper length") +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  theme(axis.title.y = element_text(color="Grey23", size=15),
        axis.text.y = element_text(size=12),
        axis.title.x = element_text(color="Grey23", size=15),
        axis.text.x = element_text(size=12),
        legend.title = element_text(size=12),
        legend.text = element_text(size=10),
        legend.position = "right",
        legend.justification = c(0.94,0.94),
        legend.background = element_rect(fill="grey88",
                                         size=0.5, linetype="solid", 
                                         colour ="darkslateblue"),
        plot.title = element_text(color="darkslateblue", size=25, family="serif"))


Summary

Data storytelling is an important aspect of data analysis. The aim of the above project was to show how to improve the extraction of information using properly made visualizations. Adding colors, highlighting variants of quality features and adjusting the appropriate scale allows you to get to know the dataset much better.